An unsupervised method for identifying loanwords in Korean
نویسنده
چکیده
This paper presents an unsupervised method for developing a character-based n-gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Expectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming traces of vowel insertion to repair consonant clusters serve as foreign seed words. What counts as a trace of insertion is determined using phoneme co-occurrence statistics in conjunction with ideas and findings in phonology. Experiments show that the method can produce an unsupervised classifier that performs at a level comparable to that of a supervised classifier. In a cross-validation experiment using a corpus of about 9.2 million words and a lexicon of about 71,000 words, mean F-scores of the best unsupervised classifier and the corresponding supervised classifier were 94.77% and 96.67%, respectively. Experiments also suggest that the method can be readily applied to other languages with similar phonotactics such as Japanese.
منابع مشابه
Statistical Identification of English Loanwords in Korean Using Automatically Generated Training Data
This paper describes an accurate, extensible method for automatically classifying unknown foreign words that requires minimal monolingual resources and no bilingual training data (which is often difficult to obtain for an arbitrary language pair). We use a small set of phonologically-based transliteration rules to generate a potentially unlimited amount of pseudo-data that can be used to train ...
متن کاملA Nonlinear Grayscale Morphological and Unsupervised method for Human Facial Synthesis Based on an Example Image
Human facial generation of example image is used as a requirement for biometric applications for the purpose of identifying individuals. In this paper, face generation consists of three main steps. In the first step, detection of significant lines and edges of the example image are carried out using nonlinear grayscale morphology. Then, hair areas are identified from the face of sample. The fin...
متن کاملPerplexity of bi-phone phonotactic models in Korean loanword phonology
The paper presents a corpus study which shows that the probability distribution of bi-phones in a lexicon of Korean loanwords is significantly different from that in a typical Korean lexicon or a lexicon consisting solely of native Korean and Sino-Korean words. This is demonstrated by comparing the perplexity of two types of bi-phone phonotactic models: a model trained on a set of Korean loanwo...
متن کاملVaried adaptation patterns of English stops and fricatives in Korean loanwords: The influence of the P-map
In order to investigate to what extent perceptual factors affect the borrowing process, we examined the borrowing of English obstruents in Korean by comparing loanword adaptation patterns with the natives’ P-map (Steriade, 2001b). The orthographic classification technique was used to obtain the P-map (e.g., Wiik, 1965; Schmidt, 1996); 40 native Koreans were asked to choose the best matching Kor...
متن کاملAn Unsupervised Learning Method for an Attacker Agent in Robot Soccer Competitions Based on the Kohonen Neural Network
RoboCup competition as a great test-bed, has turned to a worldwide popular domains in recent years. The main object of such competitions is to deal with complex behavior of systems whichconsist of multiple autonomous agents. The rich experience of human soccer player can be used as a valuable reference for a robot soccer player. However, because of the differences between real and simulated soc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Language Resources and Evaluation
دوره 49 شماره
صفحات -
تاریخ انتشار 2015